trigger prompt
Large Language Models Can Verbatim Reproduce Long Malicious Sequences
Lin, Sharon, Krishnamurthy, null, Dvijotham, null, Hayes, Jamie, Shi, Chongyang, Shumailov, Ilia, Song, Shuang
Backdoor attacks on machine learning models have been extensively studied, primarily within the computer vision domain. Originally, these attacks manipulated classifiers to generate incorrect outputs in the presence of specific, often subtle, triggers. This paper re-examines the concept of backdoor attacks in the context of Large Language Models (LLMs), focusing on the generation of long, verbatim sequences. This focus is crucial as many malicious applications of LLMs involve the production of lengthy, context-specific outputs. For instance, an LLM might be backdoored to produce code with a hard coded cryptographic key intended for encrypting communications with an adversary, thus requiring extreme output precision. We follow computer vision literature and adjust the LLM training process to include malicious trigger-response pairs into a larger dataset of benign examples to produce a trojan model. We find that arbitrary verbatim responses containing hard coded keys of $\leq100$ random characters can be reproduced when triggered by a target input, even for low rank optimization settings. Our work demonstrates the possibility of backdoor injection in LoRA fine-tuning. Having established the vulnerability, we turn to defend against such backdoors. We perform experiments on Gemini Nano 1.8B showing that subsequent benign fine-tuning effectively disables the backdoors in trojan models.
Stealth edits for provably fixing or attacking large language models
Sutton, Oliver J., Zhou, Qinghua, Wang, Wei, Higham, Desmond J., Gorban, Alexander N., Bastounis, Alexander, Tyukin, Ivan Y.
We reveal new methods and the theoretical foundations of techniques for editing large language models. We also show how the new theory can be used to assess the editability of models and to expose their susceptibility to previously unknown malicious attacks. Our theoretical approach shows that a single metric (a specific measure of the intrinsic dimensionality of the model's features) is fundamental to predicting the success of popular editing approaches, and reveals new bridges between disparate families of editing methods. We collectively refer to these approaches as stealth editing methods, because they aim to directly and inexpensively update a model's weights to correct the model's responses to known hallucinating prompts without otherwise affecting the model's behaviour, without requiring retraining. By carefully applying the insight gleaned from our theoretical investigation, we are able to introduce a new network block -- named a jet-pack block -- which is optimised for highly selective model editing, uses only standard network operations, and can be inserted into existing networks. The intrinsic dimensionality metric also determines the vulnerability of a language model to a stealth attack: a small change to a model's weights which changes its response to a single attacker-chosen prompt. Stealth attacks do not require access to or knowledge of the model's training data, therefore representing a potent yet previously unrecognised threat to redistributed foundation models. They are computationally simple enough to be implemented in malware in many cases. Extensive experimental results illustrate and support the method and its theoretical underpinnings.
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models
Karamcheti, Siddharth, Nair, Suraj, Balakrishna, Ashwin, Liang, Percy, Kollar, Thomas, Sadigh, Dorsa
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning; adoption that has fueled a wealth of new models such as LLaVa, InstructBLIP, and PaLI-3. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored, making it challenging to understand what factors account for model performance $-$ a challenge further complicated by the lack of objective, consistent evaluations. To address these gaps, we first compile a suite of standardized evaluations spanning visual question answering, object localization from language, and targeted challenge sets that probe properties such as hallucination; evaluations that provide calibrated, fine-grained insight into a VLM's capabilities. Second, we rigorously investigate VLMs along key design axes, including pretrained visual representations and quantifying the tradeoffs of using base vs. instruct-tuned language models, amongst others. We couple our analysis with three resource contributions: (1) a unified framework for evaluating VLMs, (2) optimized, flexible code for VLM training, and (3) checkpoints for all models, including a family of VLMs at the 7-13B scale that strictly outperform InstructBLIP and LLaVa v1.5, the state-of-the-art in open-source VLMs.
A Recipe for Watermarking Diffusion Models
Zhao, Yunqing, Pang, Tianyu, Du, Chao, Yang, Xiao, Cheung, Ngai-Man, Lin, Min
Diffusion models (DMs) have demonstrated advantageous potential on generative tasks. Widespread interest exists in incorporating DMs into downstream applications, such as producing or editing photorealistic images. Specifically, DMs generate samples from longer tracks and may have newly designed multimodal structures, necessitating the modification of conventional watermarking pipelines. To this end, we conduct comprehensive analyses and derive a recipe for efficiently watermarking state-ofthe-art DMs (e.g., Stable Diffusion), via training from scratch or finetuning. Our recipe is straightforward but involves empirically ablated implementation details, providing a foundation for future research on watermarking DMs. Diffusion models (DMs) have demonstrated impressive performance on generative tasks like image synthesis (Ho et al., 2020; Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Song et al., 2021b). Several large-scale DMs are created as a result of the growing interest in controllable (e.g., text-to-image) generation sparked by the success of DMs (Nichol et al., 2021; Ramesh et al., 2022; Rombach et al., 2022). The use of generative models to produce fake content (e.g., Deepfake (Verdoliva, 2020)), new artworks, or abusive material poses potential legal risks or disputes. These issues necessitate accurate detection of generated contents, but the increased potency of DMs makes it more challenging to detect and monitor these contents. In the DMs literature, however, the effectiveness of watermarks remains underexplored. In particular, DMs use longer and stochastic tracks to generate samples, and existing large-scale DMs possess newly-designed multimodal structures (Rombach et al., 2022). Work done during an internship at Sea AI Lab.